Team members: Long Zhang(47778605),Qinyu Chen(47813055), Zhengwen Zhang(47502277)
Our team selected the honey bee dataset available from kaggle(https://www.kaggle.com/jenny18/honey-bee-annotated-images), according to the description of the content, the bee image was extracted from time-lapse videos with some cropping and background removing procedures,each image contains one or two bees(robber bees), The reason we collected this dataset is because: a)The data set contains image data(>5000 images), the csv file has some details about each image, for example, the mega data such as file name,date and time of the image taken, location, health status etc. b)The resolution of each image appropriate(>20x20 pixels), the total size of all images is appropriate(50MB), making it less computation intensive for machine learning studies. 2)From learning perspective, this dataset is similar to the LWF dataset so we could potentially apply the knowledge from class to this dataset. 3)This image data set has received good reviews on kaggle and it could be an excellent dataset for machine learning studies.
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pl
import glob
from PIL import Image
import missingno as mn
import imageio
import warnings
warnings.filterwarnings('ignore')
bee = pd.read_csv("/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_data.csv")
#bee = pd.read_csv('/Users/terrence/Desktop/bee_data.csv')
bee.head()
Below are descriptions for each feature in the csv file:
File: File name in bee_imgs folder
Date: Date of video capture
Time: Time of day of video capture (military time)
Location: Location (city, state, country)
Zip Code: Zip Code to numerically describe loaction
Subspecies: Subspecies of Apis mellifera species
Health: Health description of bee
Pollen_Carrying: Presence of pollen on the bee's legs
Caste: Worker, Drone, or Queen bee
There are more than 20,000 species of bees, in the early 21st century, only seven species of honey bees are recognized, the total subspecies is about 44. Apis mellifera is truly domesticated for commercial pollination of fuits and flowers. the direct economical values of pollination is billions of dollars every year. honey produced by bees can not only be used as food, it also plays a significant role in medicine, for example, it has been shown to help with wound healing,fighting cancer, anti-oxidant and more. It will be interesting to learn what factors affect honey production by honeybees.
As we all know, honey is harvested from hives, and the honey production is correlated with the overall status of hives. it has been learnt that up to 60% loss in honeybee hive have been observed in the past winter in some states in US. beekeepers want to know whether hives are damaged or not, so they can take actions to reduce the risk. one direct way to check the hive strength is simply to examie the inside of the hive, however, it is not a good idea to check up hives as it is time and labor consuming, difficult to measure the checkup, moreover, the checkup process may interrupt bees workflow and hives. one interesting question is: is there a better way to get the health status of hives? the answer is yes if we can correlate the hive health status with bees' features by some machine learning process.
The task for this lab is to predict the health of hives based on image features. We need to predict the health of the hive. In other words, we want to know if the hive is healthy. If bees are not healthy, you want to know the cause, which disease bees' nest is suffering from. Deciding whether to check the bee hive is a compromise, it is time to check the bee hive, and usually damages the bee workflow and the bee hive. This model provides convenience and economic benefits only when it is more accurate than manual decisions.
From this lab, we build our modeling from two main aspects:PCA and classificatiion based on features. Our final purpose is to classify different groups of bees which are in different healthy conditions, and predict healthy problem that will happen in the future. Why we do PCA and feature selection?
In actual machine learning projects, feature selection / dimensionality reduction must be performed because there are several problems in the data:
1)Multicollinearity of data: There is a correlation between feature attributes. Multicollinearity will lead to spatial instability of the solution, resulting in weak generalization ability of the model;
2)High latitude space samples are sparse, which makes it difficult for the model to find data features;
3)Too many variables will hinder the model finding rule;
4)Considering only the effect of a single variable on a target attribute may ignore potential relationships between variables.
The purpose of feature selection / dimensionality reduction is: Reduce the number of feature attributes and ensure that feature attributes are independent of each other.
Citation:https://en.wikipedia.org/wiki/Principal_component_analysis
First of all, we need a good understanding of the data itself before the data analysis. the first step is to read in images as numpy arrays. Resize and recolor images as necessary.First, let's import the bee description dataset.
honey_bee_df = pd.read_csv("/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_data.csv")
#honey_bee_df = pd.read_csv('/Users/terrence/Desktop/bee_data.csv')
#---------tony---------
# imgs_path = r"C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabTwo\Lab2_shared\honey-bee-annotated-images\bee_imgs\bee_imgs"
# data_path = r"C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabTwo\Lab2_shared\honey-bee-annotated-images"
# filename = filename = r'bee_data.csv'
# bee = pd.read_csv(os.path.join(data_path,filename))
# honey_bee_df = pd.read_csv(os.path.join(data_path,filename))
#---------tony---------
#find the data types
print(bee.info())
As there are 5172 rows in the excel spread sheet, this data frame shows that every column has 5172 counts, so there are no missing values for these observations.these features are either object,int64 or bool. since the name of each column is self-explainary, I will not create data description table.
print (bee.isnull().sum())
#checking missing values amount of each features in this dataset
This confirms that there is no missing values for each feature in the dataset
Import the bee image dataset, convert the image dataset from RGB format to gray format,and resize the image dataset. We define our target lable ("health") for prediction task
#recolor images in order to get our conclusion more clear
# according to class sides,gray = 0.3 * r + 0.59 * g + 0.11 * b
def gray(rgb):
r, g, b = rgb[:,:,0], rgb[:,:,1], rgb[:,:,2]
gray = 0.3 * r + 0.59 * g + 0.11 * b
return gray
# transfet image to numpy array
x_data =[]
image_array =[]
image_target =[]
for im_path in glob.glob("/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_imgs/bee_imgs/*.png"):
#for im_path in glob.glob("/Users/terrence/Desktop/bee_imgs/bee_imgs/*.png"):
#for im_path in glob.glob(r"C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabTwo\Lab2_shared\honey-bee-annotated-images\bee_imgs\bee_imgs\*.png"):#tony
# read the images from our members laptop path
images = Image.open(im_path)
images = images.resize((78, 78),Image.ANTIALIAS)
images = gray(np.array(images))
x_data.append(images)
image_array.append(np.array(images)/255)
target =bee[bee.file == im_path.split('/')[-1]].health.tolist() # using regex to get the name of each image
#target =bee[bee.file == im_path.split('\\')[-1]].health.tolist() # using regex to get the name of each image
image_target.append(target)
Check one example (the #0 image) of images dataset pixels
# pixel of images/ here our pixel range is (0,1),because we divided all our original pixels by 255
from pandas.core.frame import DataFrame
table = DataFrame(image_array[0])
table
# check the health status
print(image_target[0])
This confirms the image_target array looks good.
#using two for loops to "extend" and "append" row after row for each image
oneDim_arr = []
for img in image_array:
each_img =[]
for row in img:
each_img.extend(row)
oneDim_arr.append(each_img)
#print(oneDim_arr[0])
#convert them into numpy arrays
X = np.array(oneDim_arr)
y = np.array(image_target)
#show the number of samples and number of features
n_samples, n_features = np.shape(X)
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
table1 = DataFrame(X)
table1.head()
in the table our pixel range is (0,1),because we divided all our original pixels by 255
IMAGE_PATH = '/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_imgs/bee_imgs/'
#IMAGE_PATH = '/Users/terrence/Desktop/bee_imgs/bee_imgs/'
#IMAGE_PATH = r'C:\Users\tony\Dropbox\smu\CSE7324\assignment\LabTwo\Lab2_shared\honey-bee-annotated-images\bee_imgs\bee_imgs' #tony
#citation: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification
tmp = honey_bee_df.groupby(['zip code'])['location'].value_counts()
df = pd.DataFrame(data={'Images': tmp.values}, index=tmp.index).reset_index()
df
From the table above, we found the second row and third row show the same city in the same state, so let's merge them
#citation: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification
honey_bee_df = honey_bee_df.replace({'location':'Athens, Georgia, USA'}, 'Athens, GA, USA')
tmp = honey_bee_df.groupby(['zip code'])['location'].value_counts()
df = pd.DataFrame(data={'Images': tmp.values}, index=tmp.index).reset_index()
df['code'] = df['location'].map(lambda x: x.split(',', 2)[1])
df
so these images were taken in 7 cities out of 6 states, covering east coast(NH),west coast(CA),south(TX,LA,GA) and north(IA)
#citation: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
trace = go.Bar(
x = df['location'],
y = df['Images'],
marker=dict(color="Tomato"),
text=df['location']
)
data = [trace]
layout = dict(title = 'Number of bees images per location',
xaxis = dict(title = 'Subspecies', showticklabels=True, tickangle=15),
yaxis = dict(title = 'Number of images'),
hovermode = 'closest'
)
fig = dict(data = data, layout = layout)
iplot(fig, filename='images-location')
Most of the images (2000) were captured in Saratoga, California. On second place is Athens, Georgia (1051), followed by Des Moines, Iowa (973).
locations = (honey_bee_df.groupby(['location'])['location'].nunique()).index
locations
#citation: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification
# here we define a function to plot images based on category features
def draw_category_images(var,cols=5):
categories = (honey_bee_df.groupby([var])[var].nunique()).index
f, ax = plt.subplots(nrows=len(categories),ncols=cols, figsize=(2*cols,2*len(categories)))
# draw a number of images for each location
for i, cat in enumerate(categories):
sample = honey_bee_df[honey_bee_df[var]==cat].sample(cols)
for j in range(0,cols):
file=IMAGE_PATH + sample.iloc[j]['file']
#file= os.path.join(imgs_path,'')+sample.iloc[j]['file'] #tony
im=imageio.imread(file)
ax[i, j].imshow(im, resample=True)
ax[i, j].set_title(cat, fontsize=9)
plt.tight_layout()
plt.show()
Directly visualize some images according to the different health conditions
import matplotlib.pyplot as plt
draw_category_images("health")
The plot above shows 5 representative images for bees that are infected with varroa(small hive beetles), bees whose hives are infected with ants,few varrao(hive beetles), healthy bees, bees with hive being robbed and bees with missing queens. This plot gives us a nice glimpse of how the bees should look like in various health conditions.
# because the array of size is 6084, it should be resize into 78*78
h = 78
w = 78
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
#plot some "healthy" bees in gray scale
plot_gallery(X, y, h, w)
Let's visualize some images from the gray images format dataset.
import matplotlib.pyplot as plt # plt to show image
plt.style.use('ggplot')
fig = plt.figure(figsize=(20,10))
plt.subplot(1,4,1)# image1
plt.imshow(image_array[3],cmap = plt.get_cmap('gray'))
plt.subplot(1,4,2)# image10
plt.imshow(image_array[44],cmap = plt.get_cmap('gray'))
plt.subplot(1,4,3)# image60
plt.imshow(image_array[55],cmap = plt.get_cmap('gray'))
plt.subplot(1,4,4)# image36
plt.imshow(image_array[36],cmap = plt.get_cmap('gray'))
The objective of this section is to perform linear dimensionality reduction of the images using principal components analysis. Visualize the explained variance of each component and analyze how many dimensions are required to adequately represent the image data.
First, we use full PCA to Perform linear dimensionality reduction and visulaize the explained variance for each components, we choose 300 components to be the number of components.
#http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
from sklearn.decomposition import PCA
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
from sklearn.decomposition import PCA
n_components = 300
print ("Extracting the top %d eigenfaces from %d faces" % (
n_components, X.shape[0]))
pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
plot_explained_variance(pca)
The interactive plot above clearly demonstrates the relationship between number of principle components and the explained variance ratio. We could tell when the principle component X is equal to 20, the knee of the plot, the cumulative explained variance ratio at that point has reached to 0.84. the cumulative explained variance ratio could reach to 0.9 when X increases to 56, which is a key point. Further increase in X does not increase the explained variance ration significantly, the cumulative variance ratio is reaching a plateau, the maximum value is about 0.97 when X is 300. Next, we will use 300 as the number of components for full PCA. Also, compare with the larger X (>300), the value is close,but X = 300 can save machine running time.
# citation from: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
from sklearn.decomposition import PCA
n_components = 300
print ("Extracting the top %d eigenbees from %d bees" % (
n_components, X.shape[0]))
pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
we virtualize the eigenvectors of some bee images.
eigenface_titles = ["eigenbee %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
To evaluate how the full PCA works, we chose an random image which has been processed after linear dimensionality reduction by Full PCA, the image was inversely transformed and reconstructed. the similarity between the original image and the reconstructed image are plotted and compared.
# citation from: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
def reconstruct_image(trans_obj,org_features):
low_rep = trans_obj.transform(org_features)
rec_image = trans_obj.inverse_transform(low_rep)
return low_rep, rec_image
idx_to_reconstruct = int(np.random.rand(1)*len(X))
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca, X_idx.reshape(1, -1))
# citation from: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid()
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Full PCA reconstructed')
plt.grid()
Interestingly, the original and full PCA reconstructed images are very similar through visual examination. Next, we will try another PCA.
We implemented another linear dimension reduction- randomized PCA. We still use 300 to be our components number.
from sklearn.decomposition import PCA
n_components = 300
print ("Extracting the top %d eigenvectors from %d images" % (
n_components, X.shape[0]))
rpca = PCA(n_components=n_components,svd_solver='randomized')
%time rpca.fit(X.copy())
eigenfaces = rpca.components_.reshape((n_components, 78, 78))
we virtualize the eigenvectors of some bee images.
eigenface_titles = ["bee %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, 78, 78)
we randomly choose an image which has been processed after linear dimensionality reduction with random PCA and reconstructed it. Then compared it with the original image to see how similiar they are.
idx_to_reconstruct = int(np.random.rand(1)*len(X))
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(rpca,X_idx.reshape(1, -1))
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((78, 78)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid()
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((78, 78)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomize PCA')
plt.grid()
We used kernel PCA to Perform non-linear dimensionality reduction and we still chose 300 components to be the number of components.
from sklearn.decomposition import KernelPCA
n_components = 300
print ("Extracting the top %d eigenbees from %d bees, not calculating inverse transform" %
(n_components, X.shape[0]))
kpca = KernelPCA(n_components=n_components, kernel='rbf',
fit_inverse_transform=True, gamma=12, # very sensitive to the gamma parameter,
remove_zero_eig=True)
kpca.fit(X.copy())
We virtualized the eigenvectors of some bee images.
eigenface_titles = ["bee %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
We randomly chose an image which has been processed after non-linear dimensionality reduction with kernel PCA and reconstructed it. Then compared it with the original image to see how similiar they are.
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image_kpca = reconstruct_image(kpca, X_idx.reshape(1, -1))
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid()
plt.subplot(1,2,2)
plt.imshow(reconstructed_image_kpca.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from KPCA')
plt.grid()
Obviously, the KPCA reconstructed image looks pretty bad compared to the original image.
Next we will focus on how to compare the representation using non-linear dimensions to using linear dimensions. The method we chose to compare dimensionality methods can quantitatively explain which method is better at representing the images with fewer components.Mean-squared error may not be a good measurement for kPCA, We would explain why we prefer one method over another.
from skimage.measure import compare_ssim as ssim
np.seterr(divide='ignore', invalid='ignore')
def BrigthnessNormalization(img):
r = img / np.sqrt( np.sum((img**2), 0) )
return r
idx_to_compare = idx_to_reconstruct
reconstructed_image = pca.inverse_transform(pca.transform(X[idx_to_reconstruct].reshape(1, -1)))
original_brightness_norm = BrigthnessNormalization(X[idx_to_reconstruct].reshape(h, w))
pca_brightness_norm = BrigthnessNormalization(reconstructed_image.reshape(h, w))
structure_similarity_pca = ssim(original_brightness_norm, pca_brightness_norm,
data_range=pca_brightness_norm.max()-pca_brightness_norm.min())
print('Structure Similarity using PCA: %.2f' % structure_similarity_pca)
reconstructed_image_kpca = kpca.inverse_transform(kpca.transform(X[idx_to_reconstruct].reshape(1, -1)))
kpca_brightness_norm = BrigthnessNormalization(reconstructed_image_kpca.reshape(h, w))
structure_similarity_kpca = ssim(original_brightness_norm, kpca_brightness_norm,
data_range=kpca_brightness_norm.max()-kpca_brightness_norm.min())
print('Structure Similarity using KPCA: %.2f' % structure_similarity_kpca)
import warnings
from ipywidgets import widgets
from sklearn.decomposition import PCA
def plt_reconstruct(idx_to_reconstruct):
idx_to_reconstruct = np.round(idx_to_reconstruct)
reconstructed_image = pca.inverse_transform(pca.transform(X[idx_to_reconstruct].reshape(1, -1)))
reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(X[idx_to_reconstruct].reshape(1, -1)))
reconstructed_image_kpca = kpca.inverse_transform(kpca.transform(X[idx_to_reconstruct].reshape(1, -1)))
plt.figure(figsize=(15,7))
plt.subplot(1,4,1)
plt.imshow(X[idx_to_reconstruct].reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,4,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Full PCA')
plt.grid(False)
plt.subplot(1,4,3)
plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Randomized PCA')
plt.grid(False)
plt.subplot(1,4,4)
plt.imshow(reconstructed_image_kpca.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Kernel PCA')
plt.grid(False)
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,n_samples-1,1),__manual=True)
From the interactive plot, we could tell that for some images (eg. #3739 image), the performance of kernel PCA is best then followed by random PCA, full PCA. But for some images (eg. #4156 image), the performance of kernel PCA is worst, the best is random PCA, followed by full PCA.
SSIM actually measures the perceptual difference between two similar images. It cannot judge which of the two is better: that must be inferred from knowing which is the “original” and which has been subjected to additional processing such as data compression. The Structural Similarity Index (SSIM) is a perceptual metric that quantifies image quality degradation* caused by processing such as data compression or by losses in data transmission. It is a full reference metric that requires two images from the same image capture— a reference image and a processed image. The closer A score of an image is to 100%, the greater the similarity of the picture to the original image.
from skimage.measure import compare_ssim
import matplotlib.pyplot as plt
import numpy as np
import cv2
original2full = []
original2kpca = []
original2rpca = []
for index in range(0,300):
reconstructed_image = pca.inverse_transform(pca.transform(X[index].reshape(1, -1)))
reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(X[index].reshape(1, -1)))
reconstructed_image_kpca = kpca.inverse_transform(kpca.transform(X[index].reshape(1, -1)))
original = X[index].reshape((h, w))
full_pca_img = reconstructed_image.reshape((h, w))
rpca_img = reconstructed_image_rpca.reshape((h, w))
kpca_img = reconstructed_image_kpca.reshape((h, w))
(score,diff) = compare_ssim(original, full_pca_img, full=True) # SSIM
diff = (diff * 255).astype("uint8")
# print("SSIM: {}".format(score))
original2full.append(score)
(score,diff) = compare_ssim(original, rpca_img, full=True) # SSIM
# diff = (diff * 255).astype("uint8")
# print("SSIM: {}".format(score))
original2rpca.append(score)
(score,diff) = compare_ssim(original, kpca_img, full=True) # SSIM
# diff = (diff * 255).astype("uint8")
# print("SSIM: {}".format(score))
original2kpca.append(score)
print("success")
print("Full PCA compared Original image:",np.mean(original2full))
print("random PCA compared Original image:",np.mean(original2rpca))
print("Kernel PCA compared Original image:",np.mean(original2kpca))
We selected the first 300 image processed by full PCA and then calculated the SSIM score of these 300 images and then got the mean of these scores. We did the same thing for randomized PCA and kernel PCA too, we found the the mean score for the images processed by full PCA is very close to that processed by randomizedPCA and much higher than that processed by kernel PCA, which could make sense, because we showed before, the performance of kernel PCA is really not that stable, because it someties performed very bad, although it sometimes performed extraordinarily well. However, the performance of full PCA and random PCA are not always best but always stable.
Here we decide to choose DAISY to finish our work. In the end of this lab, we will explore other feature extraction technique like SIFT,SURF and ORB.
from skimage.io import imshow
from skimage.feature import daisy
# lets first visualize what the daisy descripto looks like
features, img_desc = daisy(images/255,
step=50,
radius=10,
rings=3,
histograms=5,
orientations=8,
visualize=True)
imshow(img_desc)
plt.grid(False)
features = daisy(img, step=10, radius=10, rings=2, histograms=4, orientations=8, visualize=False)
print(features.shape)
print(features.shape[0]*features.shape[1]*features.shape[2])
# create a function to tak in the row of the matric and return a new feature
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape),step=10, radius=10, rings=2, histograms=6, orientations=8, visualize=False)
return feat.reshape((-1))
%time test_feature = apply_daisy(X[3],(h,w))
test_feature.shape
0.038 * len(X) # approximate how long it may run
# apply to entire data, row by row,
# takes about a minute to run
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
from skimage.filters import gabor_kernel
from scipy import ndimage as ndi
from scipy import stats
# prepare filter bank kernels
kernels = []
for theta in range(4):
theta = theta / 4. * np.pi
for sigma in (1, 3):
for frequency in (0.05, 0.25):
kernel = np.real(gabor_kernel(frequency, theta=theta,
sigma_x=sigma, sigma_y=sigma))
kernels.append(kernel)
# compute the filter bank and take statistics of image
def compute_gabor(row, kernels, shape):
feats = np.zeros((len(kernels), 4), dtype=np.double)
for k, kernel in enumerate(kernels):
filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
_,_,feats[k,0],feats[k,1],feats[k,2],feats[k,3] = stats.describe(filtered.reshape(-1))
# mean, var, skew, kurt
return feats.reshape(-1)
idx_to_reconstruct = int(np.random.rand(1)*len(X))
gabr_feature = compute_gabor(X[idx_to_reconstruct], kernels, (h,w))
gabr_feature
%time gabor_stats = np.apply_along_axis(compute_gabor, 1, X, kernels, (h,w))
print(gabor_stats.shape)
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix_gabor = pairwise_distances(gabor_stats)
From the running time above, we can find that Gabor is much slower than Daisy. Therefore, we will choose Daisy as our main feature exraction technology in our next work.
One interesting question is: Does this feature extraction method show promise for our prediction task? Why? in the following, we will use visualizations to analyze these questions.
import copy
# find closest image to current image
idx1 = 1830 # 1830 is a bee sample which the health state of it is "few varrao, hive beetles".
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)
plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
imshow(X[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()
plt.subplot(1,2,2)
imshow(X[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()
print("the row number of the closet image is :", idx2)
print("the health situdation of closet image : ", bee.at[1523,'health'])
From the above comparision between origial image and the closet image, we could tell that Daisy feature extraction really works well in predicting the bees' health state, because it could give us the accurate result of our desire predition task.
from sklearn import neighbors, datasets
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
#h = .12 # step size in the mesh
X = daisy_features
X_training, X_test = X[0:1300,:], X[1300:3744,:]
y_training, y_test = y[0:1300], y[1300:3744]
clf = neighbors.KNeighborsClassifier(7, weights='distance')
clf.fit(X_training, y_training)
Z = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
percentage = accuracy * 100
print('Accuracy : ' + str(round(accuracy * 100, 2)) + '%')
From the above code, we divideD our image dataset processed by DAISY into training and testing dataset, and used K nearest neighbor classifier to get the accuracy of performance of DAISY, We got a final prediction accuracy of 69.8%, which is not a bad sign for our bee's health predicting task.
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.preprocessing import StandardScaler
# Standarize the daisy features to get a better heat map later
std = StandardScaler()
# Find the pairwise distance between all the different image features
dist_matrix = pairwise_distances(std.fit_transform(daisy_features))
# Then we get a 1000*1000 pairwise distance matrix
print(dist_matrix.shape)
# Pick one of the style to plot
cmap = sns.set(style="darkgrid")
# Set the size of the display figure
f, ax = plt.subplots(figsize=(8, 7))
# Use a heat map of the pairwise differences (ordered by class) among all extracted features.
sns.heatmap(dist_matrix, cmap=cmap, annot=False)
A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. The seaborn python package allows the creation of annotated heatmaps which can be tweaked using Matplotlib tools as per the creator’s requirement. In our heat map, it looks pretty light which ranges form 60 to 120. Therefore, all the features are related to each other.
Finally, we would like to extract the features of two images using the three ways, and compared with each other. Then we would choose the appropriate way to continue with our work.
#citation: https://stackoverflow.com/questions/31631352/typeerror-required-argument-outimg-pos-6-not-found
import numpy as np
import cv2
from matplotlib import pyplot as plt
img1 = cv2.imread("/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_imgs/bee_imgs/001_046.png",0)
img2 = cv2.imread("/Users/chenqinyu/Desktop/lab2/honey-bee-annotated-images/bee_imgs/bee_imgs/001_047.png",0)
# find the keypoints and descriptors with orb
orb = cv2.ORB_create()
kp1, des1 = orb.detectAndCompute(img1,None)
kp2, des2 = orb.detectAndCompute(img2,None)
# create BFMatcher object
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
# Match descriptors.
matches = bf.match(des1,des2)
# Sort them in the order of their distance.
matches = sorted(matches, key = lambda x:x.distance)
# Draw first 10 matches.
img3 = cv2.drawMatches(img1,kp1,img2,kp2,matches ,None, flags=2)
plt.imshow(img3),plt.show()
#citation: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_matcher/py_matcher.html
import numpy as np
import cv2
from matplotlib import pyplot as plt
# Initiate SIFT detector
sift = cv2.xfeatures2d.SIFT_create()
# find the keypoints and descriptors with SIFT
kp1, des1 = sift.detectAndCompute(img1,None)
kp2, des2 = sift.detectAndCompute(img2,None)
# FLANN parameters
FLANN_INDEX_KDTREE = 0
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5)
search_params = dict(checks=50) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params,search_params)
matches = flann.knnMatch(des1,des2,k=2)
# Need to draw only good matches, so create a mask
matchesMask = [[0,0] for i in range(len(matches))]
# ratio test as per Lowe's paper
for i,(m,n) in enumerate(matches):
if m.distance < 0.7*n.distance:
matchesMask[i]=[1,0]
draw_params = dict(matchColor = (0,255,0),
singlePointColor = (255,0,0),
matchesMask = matchesMask,
flags = 0)
img3 = cv2.drawMatchesKnn(img1,kp1,img2,kp2,matches,None,**draw_params)
plt.imshow(img3,),plt.show()
# Initiate Surf detector
surf = cv2.xfeatures2d.SURF_create()
# find the keypoints and descriptors with Surf
kp1, des1 = surf.detectAndCompute(img1,None)
kp2, des2 = surf.detectAndCompute(img2,None)
# FLANN parameters
FLANN_INDEX_KDTREE = 0
index_params = dict(algorithm = FLANN_INDEX_KDTREE, trees = 5)
search_params = dict(checks=50) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params,search_params)
matches = flann.knnMatch(des1,des2,k=2)
# Need to draw only good matches, so create a mask
matchesMask = [[0,0] for i in range(len(matches))]
# ratio test as per Lowe's paper
for i,(m,n) in enumerate(matches):
if m.distance < 0.7*n.distance:
matchesMask[i]=[1,0]
draw_params = dict(matchColor = (0,255,0),
singlePointColor = (255,0,0),
matchesMask = matchesMask,
flags = 0)
img4 = cv2.drawMatchesKnn(img1,kp1,img2,kp2,matches,None,**draw_params)
plt.imshow(img4,),plt.show()
From the our exploration above, we find out SIFT can find more accurate keypoints. SIFT has more red keypoints which are accurate, and the green lines show that SIFT can catch much more features. Therefore, we choose SIFT to continue our next work.
Here we use SIFT features. What we decided to do is to make a list that contains all imforation about all image descriptors. Then we use the list in nearest neighbor classifier.
from skimage import img_as_ubyte
from sklearn.cluster import KMeans
X_kp2 = []
X_des2 = []
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
for j in range(len(image_array)):
img_t = image_array[j]
img2 = img_as_ubyte(img_t)
kp2, des2 = orb.detectAndCompute(img2,None)
X_kp2.append(kp2)
X_des2.append(des2)
kmeans = KMeans(n_clusters = 4)
kmeans.fit(np.array(X_des2[0]))
x=np.array(X_des2[0])
we choose the features of first image to make kmeans cluster.
wcss = []
for i in range(1, 7):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(np.array(X_des2[0]))
wcss.append(kmeans.inertia_)
plt.plot(range(1, 7), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
We can find number of clusters = 3 is a turning point, so we choose 3 as number of clusters
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(x)
plt.scatter(x[:,0], x[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()
classify the pixel feature into three group. the red points are the center of each group.
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(x)
kmeans.cluster_centers_.shape
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(6, 4, 4)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
From the new picture above, we find it is too hard to see it is a bee. The reason we thought about is that the resolution is too low,so it become a 44 picture. The resolution of the original picture is 7878. why it becomes 44? I think when we using SIFT into kmeans, many keypoints are lost. The left keypoints becomes new picture whose size is 44.